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ABSTRACT 

A  matrix-multiplication  algorithm  on  a  linear  array  using 
an  optimal  number  of  processing  elements  is  proposed.  The 
local  storage  required  by  the  processing  elements  and  the  ~f/Q 
bandwidth  required  to  drive  the  array  are  both  constants  that 
are  independent  of  the  sizes  of  the  matrices  being  multiplied. 
The  algorithm  is  therefore  modular,  that  is,  arbitrarily  large 
matrices  can  be  multiplied  on  a  large  array  built  by  cascad¬ 
ing  small  arrays.  The  array  is  well-suited  for  VLSI  implemen¬ 
tation. 


The  preparation  of  this  report  was  supported  by  the  U.S.  Air 
Force  Office  of  Scientific  Research  under  Contract  F49620-83-C 
0082. 


1.  Introduction 

Specialized  array  processors  have  been  proposed  as  a  means  of  handling  compute- 
bound  problems  in  a  cost-effective  and  efficient  manner  [4,5,6].  These  array  processors 
are  typically  made  up  of  simple,  identical  processing  elements  (which  we  will  refer  to  as 
cells  from  now  on)  that  operate  in  synchrony.  Several  array  structures  have  been  pro¬ 
posed  that  include  linear  arrays,  rectangular  arrays  and  hexagonal  arrays.  Simplicity 
and  regularity  of  linear,  rectangular  and  hexagonal  array  processors  render  them  suit¬ 
able  for  VLSI  implementation.  High  performance  is  achieved  by  extensive  use  of  pipelin¬ 
ing  and  multiprocessing.  In  a  typical  application,  such  arrays  would  be  attached  as  peri¬ 
pheral  devices  to  a  host  computer  which  inserts  input  values  into  them  and  extracts  out¬ 
put  values  from  them. 

In  practice,  linear  arrays  are  more  attractive  than  rectangular  or  hexagonal  arrays 
for  several  reasons.  Among  them  are  the  following:  Linear  arrays  have  bounded  I/O 
requirements  [6].  In  a  wafer  containing  faulty  cells,  a  large  percentage  of  non-faulty  cells 
can  be  efficiently  reconfigured  into  a  linear  array  with  constant  wire  length  between 
adjacent  cells  in  the  linear  array  [7].  Synchronization  between  cells  in  a  linear  array  can 
be  achieved  by  a  simple  global  clock  whose  rate  is  independent  of  the  size  of  the  array 
[2]- 

Linear-array  algorithms  for  dense  matrix  multiplication  have  appeared  in  [1,3,8]. 
These  algorithms  require  0(h)1  cells  and  0{n2)  time  steps  to  multiply  two  nXn 
matrices.  However,  these  algorithms  require  that  each  cell  in  the  linear  array  must  have 
0(n)  words  of  local  storage.  Hence,  the  maximum  storage  in  the  cells  imposes  an  upper 
limit  on  the  size  of  the  matrices  that  can  be  multiplied.  Consequently,  these  matrix  mul¬ 
tiplication  algorithms  are  not  modularly  expandable,  that  is,  matrices  larger  than  nXn 
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cannot  be  multiplied  by  cascading  several  such  linear  arrays  into  one  large  array.  To  do 
this,  the  local  storage  in  each  of  the  cells  would  have  to  be  increased. 

In  this  paper  we  present  a  novel  linear-array  algorithm  for  multiplying  two  nXn 
dense  matrices  wherein  the  local  storage  required  by  each  cell  in  the  linear  array  is  a 
constant  that  is  independent  of  the  sizes  of  the  matrices  being  muItiplied.Therefore  the 
algorithm  is  modular,  that  is,  arbitrarily  large  matrices  can  be  multiplied  by  extending 
the  linear  array.  The  algorithm  requires  0(n2)  cells  and  the  multiplication  is  done  in 
0(n2)  steps.  We  will  also  show  that  0(n2)  cells  used  by  the  algorithm  is  asymptotically 
optimal.  The  time  required  to  perform  the  multiplication  ( 0(  n2))  is  also  asymptotically 
optimal  as  at  least  n2  time  steps  are  required  to  insert  the  elements  into  and  retreive  the 
results  from  the  array  through  a  constant  number  of  I/O  ports. 

The  rest  of  this  paper  is  organized  as  follows.  In  Section  2,  we  describe  the  cell  and 
the  linear  array  model  that  we  will  be  using  to  describe  the  algorithm.  In  Section  3,  we 
present  the  algorithm  to  multiply  two  nX  n  matrices  and  illustrate  it  by  an  example.  In 
Section  4,  a  proof  of  the  algorithm  is  provided  and  in  Section  5  we  show  that  0(n2)  cells 
used  by  the  algorithm  is  optimal. 

2.  Cell  and  Linear  Array  Model 

We  begin  with  a  description  of  the  ceil  model.  Each  cell  (  see  Figure  2.1  )  is  capa¬ 
ble  of  performing  a  matrix  multiplication  step  (i.e.,  a  multiplication  and  an  addition)  in 
every  clock  cycle. 


Figure  2*1 


1$  and  !♦  are  the  two  control  input  ports  and  04>  and  ()♦  are  the  corresponding 
control  output  ports.  In  every  cycle,  the  control  signal  at  is  transmitted  unchanged 
to  ()♦  and  the  control  signal  at  !♦  is  transmitted  to  0$  through  a  buffer  BUF2  that 
delays  it  by  one  cycle.  At  every  clock  cycle,  1$  has  one  of  the  following  three  control 
signals:  4>,,  #2  and  “don’t-care”.  (A  two-bit  wide  I4>  is  therefore  adequate.)  Similarly, 
at  every  clock  cycle,  !♦  has  one  of  the  following  three  control  signals:  and 
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“don’t-care”. 

LA.,  IB  and  IC  are  the  input  data  ports  for  the  elements  of  the  matrices  A,  B  and  C 
respectively  where  C*AXB.  The  input  data  value  at  port  LA  is  accompanied  by  a  tag 
bit.  We  will  denote  the  input  data  value  at  port  IA  as  active  if  the  tag  bit  is  “on”,  else 
we  will  refer  to  it  as  being  inactive.  In  every  clock  cycle  the  DEC  unit  (  read  as  “decod¬ 
ing  unit”  )  strips  the  tag  from  the  input  value  at  IA.  T  denotes  the  tag  bit  and  D  the 
data. 

The  “dashed”  lines  are  the  control  signals  from  the  control  unit  to  the  adder,  mul¬ 
tiplier  and  the  MOD  unit  (read  as  “modifying  unit”  ).  In  every  clock  cycle,  the  MOD 
unit  modifies  the  tag  bit  of  the  input  value  at  LA  depending  on  the  control  signal  from 
the  control  unit.  The  modified  tag  bit  from  MOD  is  appended  to  the  data  at  D  in  the 
ENC  unit  (read  as  “encoding  unit”  ). 

BUFj  and  BUF2  are  two  buffers  whose  sole  purpose  is  to  delay  the  input  data  at 
IB  and  the  input  control  signal  at  14  respectively  by  one  cycle. 

We  now  describe  the  program  executed  by  the  cell  in  every  cycle.  At  the  begin¬ 
ning  of  a  cycle,  let  a,  6,  c  denote  the  data  at  ports  IA,  IB  and  IC  respectively.  Let  tf 
denote  the  tag  bit  accompanying  a.  Let  cx  and  c2  be  the  two  input  control  signals  at  14 
and  14  respectively.  The  cell  executes  the  following  steps  sequentially. 


insert  contents  of  BUF{  and  BUF2  into  output  ports  OB  and  04  respectively; 
if  C|  *»  4,  and  c2  »  4j 
then  begin 

set  tg  to  “on”  (  i.e.,  activate  a  ); 
go  to  exitl; 
end; 

if  c,  — «  42  and  c2  **  42 
then  begin 

set  tj  to  “off”  (  i.e.,  deactivate  a  ); 
go  to  exitl; 


l 
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end; 

if  a  is  inactive  then  go  to  exitl; 
if  a  is  active  then  go  to  exit2; 
exitl:  insert  e  in  output  port  OC; 
go  to  exit; 

exit2:  insert  c+ab  in  output  port  OC; 

exit:  insert  output  of  ENC  in  output  port  OA; 

insert  b  into  BUFlf*  insert  ct  and  c2  into  BUF2  and  Oif  respectively. 

In  every  cycle,  a  cell  either  activates  the  data  at  IA,  or  deactivates  the  data  at  IA 
or  computes  a  matrix  multiplication  step  provided  the  data  at  LA  is  active.  The  cell 
does  not  modify  the  tag  bit  when  the  control  signals  are  “don’t-care”  control  signals. 

The  linear  array  is  comprised  of  cells  indexed  from  1  to  m  where  m  depends  on  the 
size  of  the  matrices  being  multiplied.  Figure  2.2  illustrates  the  linear  array. 


Figure  2*2 


For  any  cell  i  in  the  linear  array,  its  output  ports  0$,  OA  and  OB  are  connected  to 


the  input  ports  1$,  IA  and  IB  respectively  of  cell  i+1.  Also,  its  output  ports  0¥  and 
OC  are  connected  to  the  input  ports  and  !C  respectively  of  cell  i-1. 


External  control  signals  are  inserted  at  1$  and  14  of  cell  1  and  cell  m  respectively. 
The  entries  of  matrix  A  are  inserted  at  IA  of  cell  1.  The  tags  accompanying  each  of  these 
entries  are  set  to  “off”  (i.e.,  the  entries  of  matrix  A  are  inactive  when  they  enter  the 
array).  The  entries  of  matrix  B  are  inserted  at  IB  of  cell  1  and  those  of  matrix  C  are 
inserted  at  IC  of  cell  m. 

3.  Modular  Matrix  Multiplication  Algorithm 

We  introduce  the  following  notation  to  describe  the  algorithm.  Let  a*,  b,;  ctJ 
denote  the  ijtk  entry  in  matrices  A,  B  and  C  respectively.  Elements  a,j  and  apq  in  matrix 
A  are  said  to  be  in  the  same  diagonal  if  i+j=p+q.  The  kth  diagonal  denotes  the  diagonal 
containing  a,,  where  i+j-l=k. 

The  entries  of  matrix  A  are  inserted  in  the  following  order:  entries  in  the  1st  diago¬ 
nal,  followed  by  entries  in  the  2ad  diagonal,  followed  by  entries  in  the  (2n-l)rt  diago¬ 
nal.  Within  any  diagonal,  the  entries  are  inserted  in  increasing  order  of  their  column 
indices. 

The  entries  of  matrices  B  and  C  are  inserted  in  the  following  order  entries  in  row 
1,  entries  in  row  2,  ..,  entries  in  row  n.  Within  any  row  of  matrix  B  the  entries  are 
inserted  in  decreasing  order  of  their  column  indices.  Within  any  row  of  matrix  C  the 
entries  are  inserted  in  increasing  order  of  their  column  indices. 

Recall  that  control  signals  pass  through  a  cell  without  any  change.  A  control  signal 
at  14  of  a  cell  is  transmitted  unchanged  to  04  of  the  same  cell  at  the  end  of  two  cycles 
and  a  control  signal  at  14  of  a  cell  is  transmitted  unchanged  to  04  at  the  end  of  one 
cycle.  At  each  clock  cycle  a  new  control  signal  (either  4,  42,  or  “don’t-care”)  is  inserted 
at  14  of  cell  1.  In  the  sequence  of  control  signals  inserted  at  14,  let  4/  (  42J  )  denote  the 
jth  4j  (  42  )  signal  (we  assume  that  the  indexing  begins  from  1  ).  Similarly,  in  the 


sequence  of  control  signals  inserted  at  let  'if  l  (  )  denote  the  ith  'ifl  (^2  )  signal. 

The  number  m  of  cells  required  by  the  algorithm  is  dependent  on  whether  n  is 

odd  or  even.  Define  r  as  follows:  If  n  is  odd,  let  r  be  -  and  if  n  is  even  let  r  be  —  (  we 

2  2 

assume  n>2  ).  Let  t0  denote  the  time  at  which  'l'1l  is  inserted  in  the  array. 


Algorithm  (  for  odd  n  ) 

The  number  m  of  cells  required  by  the  algorithm  for  odd  n  is  (n-lXH"l)+n2+2 
and  the  algorithm  is  comprised  of  the  following  steps. 

1.  Insert  a^  into  IA  of  cell  1  at  time  t0+2+{n-l)(n-r)+n(i+j-2)+(j-l); 

2.  Insert  b,j  into  IB  of  cell  1  at  time  t0+l+3(r+lXi-lHi~l)> 

3.  Insert  0  into  IC  of  cell  m  at  time  t0+2 r3n(i-l )+2(j-l ); 

4.  Insert  4>,J  into  I<fr  of  cell  1  at  time  t0+24-3(r+lXj-l); 

5.  Insert  into  I4»  of  cell  1  at  time  t0-(n-l)+3(r+lXj-l); 

0.  Insert  'if  l  into  W  of  cell  m  at  time  t0-*-3n(i— 1); 

7.  Insert  into  1+  of  cell  m  at  time  t0+2n+2+3n(i-l); 

8.  For  all  cycles  between  t0-{n-lXr+l)-n2-l  and  t0+5n2-2n+l  do  the  following: 

a.  if  no  entry  of  matrix  A  is  being  inserted  into  IA  of  cell  1  then  insert  0; 

b.  if  no  valid  control  signals  are  being  inserted  into  I4>  and  I'F  of  ceil  1  and  cell  m 
respectively  then  insert  “don’t-care”  control  signals. 

The  number  of  cells  required  by  the  algorithm  for  even  n  is  3n(r+l).  The  algo¬ 
rithm  is  similar  to  the  algorithm  for  odd  n  except  steps  1  and  8.  For  even  n,  a(J  is 
inserted  at  time  t0+l+(n-r)(n-l)+(i+j-2Xn-l)+(j-l)  in  step  1  and  step  8  is  carried  out 
between  cycles  t0-3n(r+l)-I  and  t0+l+(n-l)(3n+r+5). 
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In  Tables  1,  2  and  3  the  entry  in  the  itk  row  and  jtk  column  is  the  time  at  which 
the  (ij)th  element  in  matrices  A,  B  and  C  respectively  is  inserted  into  the  array. 

Entries  in  Table  4  below  indicate  the  times  at  which  the  control  signals  are 
inserted.  The  entry  24  in  the  3rd  column  of  the  2“d  row  is  the  time  at  which  <t>23  is 
inserted  into  the  port  1$  of  cell  1. 
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Tables  5  and  6  give  the  times  and  the  cells  where  meets  <t>/  and  ^  meets 
respectively. 


*i  <*u,5,24> 
*,2  <a2„8,30> 

V3  <  a31 ,1 1 ,36> 


<  *22,8,32  > 

<  *32)9(38  >  <  *33,7,40  > 

Table  5 


*2 

*2  <au,9,28> 

*22  <a;i,12,34> 

*23  <a3„l5,40> 


<*12>7,30> 
<*22,1 0,38  > 
<a32,13,42> 


<  *13,5,32  > 

<  *23,8,38  > 

<  a33,ll, 44  > 


Table  0 
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The  entry  in  the  itk  row  and  j*k  column  of  Table  5  is  a  3>tuple  <aij,x,y>  where  x  is 
the  cell  where  and  meet  and  y  is  the  time  at  which  they  meet  at  x.  At  the  same 
time  ajj  also  appears  at  the  port  LA  of  x.  Consequently  a^  is  activated  in  x  at  time  y. 

Similarly  the  entry  in  the  itk  row  and  j*k  column  of  Table  6  gives  the  time  and  cell 
wherein  a,j  gets  deactivated. 

We  will  trace  the  computation  of  c12  as  an  illustration.  The  trace  is  depicted  in 
Table  7. 
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Consider  any  ith  row  in  Table  7.  The  1**  column  in  the  itk  row  is  the  time  at  which 
C|j  appears  at  the  input  port  IC  of  the  ceil  whose  index  appears  in  the  2nd  column.  The 
entries  in  the  3rd  and  4tk  columns  are  the  elements  at  the  cell's  IA  and  IB  ports  at  that 
time.  For  instance,  the  9*k  row  indicates  that  c12  appears  at  the  input  port  IC  of  cell  7  at 
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time  26  and  au  and  b12  are  the  elements  at  the  ports  IA  and  IB  respectively  of  cell  7  at 
time  26. 

The  “starred”  entries  in  the  3rd  column  are  nsed  to  indicate  that  the  corresponding 
entries  are  active.  For  instance,  au  is  active  when  it  appears  at  the  inpat  port  LA  of  celt 
7  at  time  26.  On  the  other  hand  a3i  is  inactive  when  it  appears  at  the  port  LA  of  cell  4  at 
time  29. 

From  Table  7  it  can  be  seen  that  c12  gets  updated  only  in  cells  7,  5  and  3.  In  any 
other  cell  it  is  not  updated  as  either  it  encounters  a  0  or  an  inactive  element  of  matrix  A 
at  the  cell's  LA  port. 

4.  Proof  of  Correctness 

We  now  establish  the  correctness  of  the  algorithm.  We  will  only  prove  this  for 
odd  n  as  the  proof  for  even  n  is  similar.  Let  c^  denote  the  element  0  inserted  at  1C  of 
cell  m  at  time  t0+2+3n(i-l)+2(j-l)  in  step  3  of  the  algorithm. 

We  will  say  that  the  elements  of  the  three  matrices  and  the  control  signals  meet  at 
a  cell  whenever  they  appear  at  the  cell’s  input  ports  in  the  same  cycle.  (For  instance, 
a,,  and  b5J  meet  at  cell  h  if  aj,  and  b,j  appear  at  the  input  ports  IA  and  IB  respectively 
of  cell  h  in  the  same  cycle.) 

Each  cell  in  the  linear  array  has  five  I/O  ports  (  three  for  inserting  and  extracting 
elements  of  matrices  A,  B  and  C  and  two  for  inserting  and  extracting  <hj,  <J>2  and  ♦  4*2 
control  signals  ).  In  the  following  Lemma  we  show  that  these  I/O  ports  are  never  “over¬ 
loaded  ”  by  showing  that  distinct  elements  can  never  appear  simultaneously  at  the  same 
input  port  of  any  cell  in  the  linear  array. 

Lemma  4.1:  Distinct  elements  of  matrices  A,  B  and  C  do  not  simultaneously  reach  the 
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input  ports  IA,  IB  and  IC  of  any  cell  in  the  linear  array.  Distinct  4>lr  <*>2  control  signals 
do  not  simultaneously  reach  the  input  port  1$  of  any  cell,  and  distinct  tyt,  4»2  control 
signals  also  do  not  simultaneously  reach  the  input  port  of  any  cell. 

Proof:  We  will  show  that  distinct  elements  of  matrix  A  do  not  simultaneously  reach 
the  input  port  IA  of  any  cell  and  the  proof  will  be  similar  for  elements  of  matrix  B  and 
matrix  C  as  well  as  for  the  control  signals  $2,  and  i>2. 

Let  a,j  and  be  two  distinct  elements  that  appear  simultaneously  at  the  input 
port  of  cell  s.  The  time  taken  by  a^  to  reach  the  input  port  of  s  is 
[t0+2+(n-lXn-r)+n(i+j-2)+(j-l)|  +  {s}.  The  expression  within  [  ]  is  the  time  at  which 
a,j  is  inserted  into  the  array  and  the  expression  within  {  }  is  the  time  taken  by  a^  to 
reach  s  after  it  is  inserted.  Similarly,  the  time  taken  by  aN  to  reach  s  is 
t0+2+(n-lXn-r)+tt(p+q-2)+(q-I)  +(s].  Equating  these  two  times  and  simplifying  we 

obtain  (i-p+j-q)=  Now  the  left-hand-side  is  an  integer  and  the  right-hand-side  is 

n 

a  fraction  since  0  <  [j-q|  <  (n-l).  So  for  equality  to  hold  j=q  and  i=p.  So  a^  and  aM 
are  not  distinct  as  assumed  —  a  contradiction.  f  I 

Recall  that  a  cell  performs  a  matrix  multiplication  step  only  if  th*  element  at  its  IA 
port  is  active.  Hence,  for  any  c^  to  be  correctly  updated  it  must  meet  an  active  aj,  (\/s 
|l<s<n).  We  next  identify  the  cells  in  which  a*,  is  active. 

Lemma  4.2:  Let  p=»n(i-l)+(r+l)(n-s)+l  and  q—  n(i-l)+(r+lXn-s)+n+2.  If  aj,  is  active 
in  a  cell  y  then  p  <  y  <  q. 

Proof:  aj,  is  activated  whenever  it  meets  a  and  4^  control  signal  simultaneously.  Let 
h  be  the  cell  index  where  aj,  meets  ♦/  and  'Iff  simultaneously.  Let  t(aj,),  t(4>/)  and 


t( ♦  j*)  denote  the  times  at  which  a^,  ♦/  and  iff  respectively  are  inserted  into  the  array. 
Let  h(a„),  h(<t>/)  and  h(*,g)  denote  the  time  taken  by  a,,,  and  iff  respectively  to 
reach  h  after  being  inserted  into  the  array.  Now  a^,  and  iff  meet  at  h.  Hence 
t(aji)+h(aj,)=t4<h/)+h($/)=t(¥,g)+h(,I'is).  From  the  algorithm  we  obtain  the  following: 

(1)  t( aj, )=t0 +2+( n- 1 X n-r)+n( i +s-2  )+(s- 1 )  and  h(aj=h-l. 

(2)  t(4>/)=t0+2+3(r+lXf-l)  and  h(<h/)«2(h-l)  (The  multiplication  factor  2  appears  in 


h(<t>/)  as  control  signals  travel  at  a  velocity  of  —  a  cell  per  clock  cycle). 


(3)  t('k1*)=t0+3n(g-l)  and  h('k1*)=m-h  (h  is  subtracted  from  m  in  h('k1g)  as  ♦j  control 

signals  travel  from  cell  m  to  cell  1). 

Now  t(<fr/)+h(<l>if)=t('k1*)+h(*1g)  and  so  from  (2)  and  (3)  we  can  obtain  h=*n(g- 
l)+(r+lXn*0+L  Also  t(aij)+h(aj,)==t(<l>1f)+h(<t/)  and  so  from  (1)  and  (2)  we  can  obtain 
(n-lXn-r)+n(i-l)+(n+lXs-l)=3(r+lXf-l)+h”l  which  on  substituting  h«=n(g-l)+(r+lXn- 
f)+l  simplifies  to  n(s-f-g+i)=f-s.  Since  0<|f-s|<n-l,  so  for  equality  to  hold  f=s  and 
g=*i.  So  ♦/  —  ♦'  and  iff  —  and  h=p.  So  ag,  only  meets  Qf  and  It  meets  them 
at  cell  p.  Hence  a^  is  activated  in  cell  p. 


We  can  similarly  show  that  a^  only  meets  <&2’  and  and  it  meets  them  at  cell  q 
and  hence  aj,  is  deactivated  in  cell  q.  Consequently,  a*  is  active  only  in  a  cell  y  where 

p  <  y  <  q-  □ 


Having  identified  the  cells  in  which  aj,  is  active,  we  will  now  establish  that  ctJ 
always  meets  an  active  a,,  and  b,j  (Vs|l<s<n)  in  the  same  cell. 


Lemma  4.3  Let  p««n(i*l)+(r+lXn*s)+l  and  x—p-l-j.  Then,  for  any  ij,s  (1  <  i,j,s  <  n), 
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1.  a„  b,j  and  c(J  will  only  meet  at  cell  x,  and 

2.  a„  is  active  then. 

Proof:  Let  ais  b5J  and  c(J  meet  at  cell  h.  Let  t(ait),  t(b,j)  and  t(ctJ)  denote  the  time  at 
which  a,,  b9J  and  c^  respectively  are  inserted  into  the  array.  Let  h(a„),  h(b,,)  and  hfc^) 
denote  the  time  taken  by  ai3i  bSJ  and  c(j  respectively  to  reach  cell  h  after  being  inserted 
into  the  array.  Equating  t(a„)+h(au)  to  t(b,j)+h(bgJ)  we  can  obtain  h—x~n(i- 

1  )+(r+ 1 X  n-s)+l  +j. 

Now  a,,  b,j  and  Cjj  will  pass  through  every  cell  indexed  from  1  to  m.  We  will  first 
show  that  they  pass  through  h  by  showing  that  1  <  h  <m.  The  minimum  value  of  h  is 

2  which  is  obtained  when  i==*j==*l  and  s*n.  Clearly,  2  >  1  and  hence  h>l.  The  max* 
imum  value  of  h  is  n2+(n-lXr+l)+l  which  is  obtained  when  i*»j«*n  and  s=»l.  Clearly 
n2+(n-lXr+l)+l<m  and  hence  h  <  m. 

1.  Hence  aj,,  bSJ  and  Cy  meet  at  cell  x.  Lastly,  cell  x  is  the  only  cell  where  they  will 
meet  as  travels  in  a  direction  opposite  to  that  of  a„  and  bSJ. 

2.  That  a„  is  active  follows  immediately  from  Lemma  4.2.  I  I 

1=1 

From  Lemma  4.3  we  can  assert  that  Cjj>  £  a^bjj.  To  assert  that  Cjj=  £  a^b^  we 

i=i  »=i 

must  ensure  that  if  Cy  does  not  meet  an  active  a^  in  a  cell  then  either  it  encounters  an 
inactive  element  of  matrix  A  or  the  element  0  at  the  cell's  IA  port. 

Lemma  4.4:  If  an  active  a;,  meets  c,T  then  u—i. 

Proof:  Let  p— n(i-l)H-(r+lXn-s)+l  and  q»»n(i-l)+(r+lXn-s)+n+2.  By  Lemma  4.2  a*  is 
active  in  any  cel)  y  such  that  p  <  y  <  q. 
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Let  t(aj,),  t(cav)  respectively  denote  the  times  at  which  and  c,T  are  inserted  into 
the  array.  Let  p(a,s)  and  q(aj,)  denote  the  times  taken  by  a;,  to  reach  cell  p  and  ceil  q 
respectively.  Let  y(car)  denote  the  time  taken  by  ctT  to  reach  cell  y  after  being  inserted 
into  the  array.  caT  meets  an  active  au  and  hence  t(au)+p(a.j)  <  t(caT)+y(caT)  < 

Let  y=p+A.  As  q-p*=n+l  and  p  <  y  <  q,  so  0  <  A  <  n+1.  Now 
t(c»»)*st0+2+3a(u-l)+2(v-l)  and  y<cttT)=m-p-A.  Since  t(ai,)+p(ab)  <  t(cBT)+yfc„r)  we 
can  obtain: 

A  <  3n(u-i)+2v . (a) 

Also  as  t(cav)+y(cay)  <  t^J+qfa,,)  we  obtain: 

A  >  3n(u-i)+2v-n-l . (b) 

A  >  0  and  so  3n(u-i)>*2v.  For  u<i  this  inequality  does  not  hold  as  the  minimum  value 
of  -2v  is  -2n  and  the  maximum  value  of  3n(u-i)  is  -3n  when  u-i=-l. 

So  u  >  i . (c) 

A  <  n+1  and  so  3n(u-i)+2v-n-l  <  n+1  which  reduces  to  3n(u-i)  <  2(n+l-v).  For 
u>i  this  inequality  does  not  hold  as  the  maximum  value  of  2(n+l-v)  is  2n  when  v**l. 
The  minimum  value  of  3u(u*i)  is  3n  when  u>i  and  u*i»l. 

So  u  <  i . (d) 

From  (c)  and  (d),  n=i.  I  1 

Lemma  4.5:  In  any  cell  y  in  the  linear  array  and  for  any  ij  (1  <  i,j  <  n)  cu  always 
encounters  an  element  of  matrix  A  or  a  0  at  the  IA  port  of  cell  y. 

Proof:  Let  t(c,j)  denote  the  time  when  c,j  is  inserted  into  the  array  and  y(cj,)  denote  the 
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time  taken  to  reach  cell  y  after  insertion.  Now  t(cjj)™t0+2+3n(i-l)+2(j-l)  and 
y^e^J^m-y.  The  element  encountered  by  ci}  at  the  LA  port  of  cell  y  must  have  been 
inserted  into  the  array  at  cell  1  at  time  r=*t(cjj)+y(cjj)-y+l.  Recall  from  step  8  of  the 
algorithm  that  either  the  element  0  or  an  element  of  matrix  A  is  inserted  into  the  array 
between  cycles  t0-(n-lXn+l}-n2~l  and  t0+5n2-2n+l.  If  we  show  that 
t0-(n-l)(r+l)-n2-l  <  z  <  t0+5n2-2n+l  then  clearly  the  element  inserted  into  the  array 


at  the  IA  port  of  cell  1  is  either  the  element  0  or  the  element  of  matrix  A. 

Now  t=®t(cjj)+y(cjj)-y+ 1 

«*t0+2 +3n(i- 1  )+2(  j- 1  )+n2+2+(  n-1 Xr+ 1  )-2y + 1 

It  can  be  easily  seen  from  the  expression  above  that  z  is  minimum  when  i  and  j  are 
minimum  and  y  is  maximum.  i”**l  and  j**l  are  the  minimum  values  for  i  and  j  and 
ysm=(a-lXr+l)+n2  is  the  maximum  value  of  y.  Similarly,  2  is  maximum  when  i— n, 
j*n  and  y=l.  Let  zm„  and  z^,  denote  the  maximum  and  minimum  z  respectively.  It 
can  be  easily  shown  that  zmiB  >  t0-(n-lXr+l)-n2-l  and  z _ <  t0+5n2-2n+l.  tZJ 


We  can  now  assert  that  c^  is  correctly  computed  when  it  exits  the  array. 


Theorem  4.1:  For  any  i  j  (1  <  i,j  <  n),  the  value  of  c^  is  £  aj»b,j  w^en  it  exits  the 

5=1 

array. 


Proof:  By  Lemma  4.5  c^  will  either  meet  an  element  of  matrix  A  or  the  element  0  at 
any  cell. 

1.  By  Lemma  4.3  it  will  meet  a;,  and  b,j  in  the  same  cell. 

2.  By  Lemma  4.4  if  it  meets  an  element  aIT  of  matrix  A  and  uy4j  then  a,,  is  inactive. 


%■  \U'V.V. % ' -.'.s' V. 


From  (1)  and  (2)  the  Theorem  follows.  tH] 


5.  Proof  of  Optimality 

We  will  now  establish  that  the  number  of  ceils  used  by  the  modular  linear-array 
algorithm  is  asymptotically  optimal.  We  establish  this  result  under  the  following  assump¬ 
tions: 

1.  Any  special-purpose  machine  (like  a  linear  array)  that  multiplies  matrices  A  and  B 
must  compute  a^by  (  \ft,  \/j  and  \/k  |l<ij,k<n). 

2.  The  special-purpose  machine  has  a  constant  number  of  I/O  ports. 

3.  The  elements  of  the  matrices  A,  B  and  C  are  inserted  into  the  special-purpose 
machine  only  once  through  the  input  ports. 

Under  these  assumptions  we  will  establish  that  fl(n2)  2  is  a  lower  bound  on  the 
storage  that  is  required  by  any  special-purpose  machine  that  multiplies  two  nXn 
matrices.  We  obtain  this  bound  by  formulating  the  computation  of  matrix  multiplication 
as  a  game  played  with  tokens  on  an  undirected  graph  constructed  as  follows: 

Let  Gk=*(Vk  Ek),  k-»l,..,n  where 

Vk={fik  hkj  |  i«-l,..n  and  j— l,..,n}  and 
Ek={  <fik,  bkj>  |  i=*l,..,n  and  j— l,..,n} 

The  rules  of  the  game  are  as  follows: 

1.  A  token  is  placed  on  fjk  (hkj)  when  ajk  (bkj)  is  inserted  into  the  machine. 

2.  Updating  ctJ  (  by  adding  ajkbkj  to  c^  for  some  k)  results  in  removing  the  edge 
<flki  hkj>  from  Gk. 

*/n)  =”  0(n2)i/ there  exist*  »  positive  cosstsat  e  for  which  /[n)^Cn2 


3.  An  edge  is  removable  only  if  there  are  tokens  at  both  end  vertices. 

4.  A  token  from  a  vertex  is  removable  only  if  all  the  edges  incident  on  the  vertex  are 
removable.  When  a  token  from  a  vertex  is  removed  then  all  the  incident  edges  on 
the  vertex  are  deleted.  (The  token  will  eventually  leave  the  machine  and  will  never 
reenter.) 

We  will  assume  that  each  token  occupies  unit  storage  (0(1)).  We  also  assume  that  a 
partially  updated  Cjj  also  occupies  unit  storage.  (At  any  instant  of  time  is  partially 
updated  if  there  exists  some  k  (l<k<n)  such  that  ajkbkj  either  has  not  been  computed 
and/or  added  to  c^  by  that  time  instant  .) 

Let  xk  be  the  earliest  time  at  which  the  first  token  in  Gk  is  removable  and  let  yk  be 
the  earliest  time  at  which  all  the  tokens  in  Gk  are  removable.  Since  only  a  constant 
number  of  tokens  enter  the  machine  at  any  time,  by  choosing  n  sufficiently  large,  we  can 
ensure  that  \/k  (l<k<n)  xk<yk.  \/k  (l<k<n),  let  Ik=[xk  yk|  denote  the  time  interval 
between  and  including  xk  and  yk. 

Lemma  5.1:  At  any  time  t  such  that  xk<t<yk,  there  are  at  least  n  tokens  in  Gk. 

Proof:  Without  any  loss  of  generality,  let  the  first  (or  one  of  the  first  if  there  are  more 
than  one)  token(s)  that  can  be  removed  from  Gk  be  the  one  on  vertex  fmk.  At  tt  ■  xk, 
then,  there  must  be  tokens  on  all  hkj  (l<j<n).  We  claim  that  no  token  on  any  hkj 
will  be  removable  at  any  t  (xk<t<yk). 

Assume  this  is  not  the  case,  and  at  t<yk,  let  hk,  be  the  first  vertex  (or  one  of  the 
first  vertices)  from  which  a  token  is  removable.  This  implies  that  there  must  be  tokens 
on  all  vertices  fjk  that  still  have  incident  edges.  This  means  that  all  the  edges  still 
remaining  in  Gk  are  removable,  and  consequently  all  the  remaining  tokens  in  Gk  are 


removable  at  time  t.  But  then  t«*yk  —  a  contradiction.  Hence  no  token  on  any  hkj  is 
removable  at  any  time  t  (xk<t<yk).  Each  hkj  has  a  token  and  hence  the  Lemma. 

□ 


Lemma  5.2:  Let  m<n.  For  any  i,  if  t>yj  and  G;  has  m  tokens  then  at  least  —  edges 

2 


must  have  been  deleted  from  Gj. 


Proof:  There  are  m  tokens  in  Gj.  Since  t>y,,  the  absence  of  a  token  on  a  vertex  means 
that  all  the  n  edges  incident  on  the  vertex  have  been  deleted.  (At  t—yj,  all  edges  in  Gj 
are  removable).  The  number  of  absent  tokens*2n-m  which  is  greater  than  n  as  m<n. 

Now  one  edge  is  in  common  with  at  most  two  vertices.  Thus  the  2n-m  absent  tokens 

„2  . _ 

result  in  at  least  —  deleted  edges.  I  I 

Let  us  impose  an  ordering  on  the  sets  Ik  such  that  x,i<xls<..<xl(i  and  let  T  «» 
{Ik  I  ykl*..)  and  A“0k  I  Tk>*i.}- 

Theorem  5.1:  Any  matrix-multiplication  machine  requires  0(n2)  storage. 

Proof:  Since  |r|+|A|— n,  either  |rj  >  -jj.  or  |A|  >  -2-. 

•  • 

Cut  1:  |A|  >  —  (see  Figure  5.1) 


Figure  5*1 


At  t*x,a  all  the  intervals  in  A  satisfy  Lemma  5.1.  Hence  at  t=xia,  there  are  at  least 
n(  -5-)  tokens  in  the  machine.  So  the  storage  required  is  Q(n2). 


Cate  ft.  |r|  >  -5-  (see  Figure  5.2) 


time 


Figure  5-2 


At  t*=Xja,  either  all  Gk,  such  that  Ik6A,  have  n  tokens  on  them,  or  at  least  one  of  them 
has  less  than  n  tokens.  If  every  Gk  has  n  tokens  then  the  storage  required  is  again 


fl(n2).  If  any  one,  say  Gr,  has  less  then  n  tokens  then  by  Lemma  5.2  Gr  must  have 
n2 

released  at  least  —  edges.  Now  each  released  edge  corresponds  to  a  partially  updated 
Cjj.  None  of  the  ctJ’s  could  have  left  the  machine  as  all  of  them  are  finally  updated  only 


at  t>xv  Thus  at  any  time  t  (yk<t<x,B)  there  are  at  least  partially  updated  c,,’s  in 


the  machine.  The  case  yk*xja  is  covered  by  assumption  2  which  precludes  the  possibil* 


SI 


ity  of  all  these  Cj/s  being  instantaneously  updated  and  leaving  the  machine.  So  the 
storage  required  for  the  partially  updated  e^’s  must  be  fl(n2).  I  I 

Theorem  5.2:  0(n‘)  cells  used  by  the  modular  linear-array  algorithm  is  optimal. 

Proof:  From  Theorem  5.1  it  follows  that  the  modular  linear-array  algorithm  requires 
Q(n2)  storage.  Now  each  cell  in  the  linear  array  has  constant  storage  and  hence  the 
Theorem.  1  1 

Conclusion 

We  have  described  a  novel  linear-array  matrix  multiplication  algorithm  that  uses 
an  asymptotically  optimal  number  of  cells.  The  cells  used  in  the  array  are  simple  requir¬ 
ing  a  constant  amount  of  local  storage  that  is  independent  of  the  sizes  of  the  matrices 
being  multiplied.  The  cells  can  be  built  using  off-the-shelf  components.  The  array  can  be 
modular ly  expanded  to  accomodate  arbitrary  matrix  sizes  by  adding  more  of  these  sim¬ 
ple  cells. 
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